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Abstract 


The variational autoencoder (VAE; |Kingma & Welling] ( |2014| )) is a recently pro¬ 
posed generative model pairing a top-down generative network with a bottom-up 
recognition network which approximates posterior inference. It typically makes 
strong assumptions about posterior inference, for instance that the posterior dis¬ 
tribution is approximately factorial, and that its parameters can be approximated 
with nonlinear regression from the observations. As we show empirically, the 
VAE objective can lead to overly simplified representations which fail to use the 
network’s entire modeling capacity. We present the importance weighted autoen¬ 
coder (IWAE), a generative model with the same architecture as the VAE, but 
which uses a strictly tighter log-likelihood lower bound derived from importance 
weighting. In the IWAE, the recognition network uses multiple samples to ap¬ 
proximate the posterior, giving it increased fiexibility to model complex posteri¬ 
ors which do not fit the VAE modeling assumptions. We show empirically that 
IWAEs learn richer latent space representations than VAEs, leading to improved 
test log-likelihood on density estimation benchmarks. 


1 INTRODUCITON 


In recent years, there has been a renewed focus on learning deep generative models (Hinton et al. 

2006 

Salakhutdinov & E.[ [20091 

Gregor et al.[ [20141 [Kingma & Welling[ [20141 

Rezende et al. 

2014 

). A common difficulty facec 

by most approaches is the need to perform posterior inference 


during training: the log-li kelihood gradients for most l atent variable models are defined i n terms 
of posterior statistics (e.g. [SalakhuMinov & ET] ( |2009| ); |Neal| ( |1992| ); [Gregor et al.| ( |2014| )). One 
approach for dealing with this problem is to train a recognition network alongside the generative 
model ( [Dayan et al.||1995| ). The recognition network aims to predict the posterior distribution over 
latent variables given the observations, and can often generate a rough approximation much more 
quickly than generic inference algorithms such as MCMC. 


The variational autoencoder (VAE; [Kmgma & Welling[p014[ ); [Rezende et aL] ( [2014[ )) is a recently 
proposed generative model which pairs a top-down generative network with a bottom-up recognition 
network. Both networks are jointly trained to maximize a variational lower bound on the data log- 
likelihood. VAEs hav e recently been successful at separating style and conten t ([Kingma et ar|[2014t 
Kulkarni et al.[[2015] ) and at learning to “draw” images in a realistic manner ( [Gregor et al.[[2()15] f 


VAEs make strong assumptions about the posterior distribution. Typically VAE models assume that 
the posterior is approximately factorial, and that its parameters can be predicted from the observables 
through a nonlinear regression. Because they are trained to maximize a variational lower bound 
on the log-likelihood, they are encouraged to learn representations where these assumptions are 
satisfied, i.e. where the posterior is approximately factorial and predictable with a neural network. 
While this effect is beneficial, it comes at a cost: constraining the form of the posterior limits the 
expressive power of the model. This is especially true of the VAE objective, which harshly penalizes 
approximate posterior samples which are unlikely to explain the data, even if the recognition network 
puts much of its probability mass on good explanations. 


In this paper, we introduce the importance weighted autoencoder (IWAE), a generative model which 
shares the VAE architecture, but which is trained with a tighter log-likelihood lower bound de- 
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rived from importance weighting. The recognition network generates multiple approximate pos¬ 
terior samples, and their weights are averaged. As the number of samples is increased, the lower 
bound approaches the true log-likelihood. The use of multiple samples gives the IWAE additional 
flexibility to learn generative models whose posterior distributio ns do not fit the VAE modelin g as¬ 
sumptions. This approach is related to reweighted wake sleep ( [Bornschein & Bengio[ 2015| ), but 
the IWAE is trained using a single unifled objective. Compared with the VAE, our IWAE is able to 
learn richer representations with more latent dimensions, which translates into signiflcantly higher 
log-likelihoods on density estimation benchmarks. 


2 Background 


In this section, we review the variational autoencoder (VAE) model of |Kingma & Welling ( |2Q14| ). In 
particular, we descr ibe a generalization of the architecture to multiple stochastic hidden layers. We 
note, however, that [Kingma & Welling ( |2014| ) used a single stochastic hidd en layer, and there ar e 
other sensible generalizations to multiple layers, such as the one presented by Rezende et al.| ( |2014 l. 


The VAE deflnes a generative process in terms of ancestral sampling through a cascade of hidden 
layers: 

p(x|0) = E] pO^^\d)p{h^~^\h^,e)---p{x.\h^,e). (1) 

Here, 0 is a vector of parameters of the variational autoencoder, and h = {h^,..., h^} denotes the 
stochastic hidden units, or latent variables. The dependence on 9 is often suppressed for clarity. Eor 
convenience, we deflne = x. Each of the terms p(h^|h^+^) may denote a complicated nonlinear 
relationship, for instance one computed by a multilayer neural network. However, it is assumed 
that sampling and probability evaluation are tractable for each p(h^|h^+^). Note that L denotes 
the number of stochastic hidden layers; the deterministic layers are not shown explicitly here. We 
assume the recognition model g'(h|x) is deflned in terms of an analogous factorization: 

</(h|x) = </(hi|x)g(h2|hi) ■ ■ (2) 

where sampling and probability evaluation are tractable for each of the terms in the product. 


In this work, w e assume the same families of conditional probability distributions as [Kingma & 
Welling (2014). In particular, the prior p(h^) is flxed to be a zero-mean, unit-variance Gaussian. 
In general, each of the conditional distributions p(h^| and is a Gaussian with 

diagonal covariance, where the mean and covariance parameters are computed by a deterministic 
feed-forward neural network. Eor real-valued observations, p(x|h^) is also deflned to be such a 
Gaussian; for binary observations, it is deflned to be a Bernoulli distribution whose mean parameters 
are computed by a neural network. 

The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived from 
Jensen’s Inequality: 


logp(x) = logE,(h|x) 


p(x, h) 
g(h|x)_ 


— ^g(h|x) 


log 


p(x, h) 
g(h|x)_ 


= ^(x). 


(3) 


Since £(x) = logp(x) — DKL(^(h|x)||p(h|x)), the training procedure is forced to trade off the 
data log-likelihood logp(x) and the KL divergence from the true posterior. This is beneflcial, in that 
it encourages the model to learn a representation where posterior inference is easy to approximate. 

If one computes the log-likelihood gradient for the recognition network directly from Eqn.[^ the re¬ 
sult is a REINEORCE-like update rule whic h trains slowly because it does not use the log -likelihood 
gradients with respect to latent variables ( [Dayan et al. |1995[ [Mnih & Gregorj |2Q14| ). Instead, 
[Kingma & Welling ( [2014[ ) proposed a reparameterization of the recognition distribution in terms 
of auxiliary variables with flxed distributions, such that the samples from the recognition model are 
a deterministic function of the inputs and auxiliary variables. While they presented the reparameter¬ 
ization trick for a variety of distributions, for convenience we discuss the special case of Gaussians, 
since that is all we require in this work. (The general reparameterization trick can be used with our 
IWAE as well.) 

In this paper, the recognition distribution g(h^|h^“^,0) always takes the form of a Gaussian 
9), 9)), whose mean and covariance are computed from the the states of 
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the hidden units at the previous layer and the model parameters. This can be alternatively expressed 
by first sampling an auxiliary variable ^ A/’(0,1), and then applying the deterministic mapping 


The joint recognition distribution g(h|x, 0) over all latent variables can be expressed in terms of 
a deterministic mapping h(€, x, 6 ), with e = (e^,..., e^), by applying Eqn. 4 for each layer in 
sequence. Since the distribution of e does not depend on we can reformulate the gradient of the 
bound >C(x) from Eqn.j^by pushing the gradient operator inside the expectation: 


V6I log IEh~g(h|x,0) 


P(x,h|6^) 

^(h|x,0) 






log 


^~Ar(o,i) 


Velog 


p{x,h{€,x,d)\e)' 
q{h{e,x,e)\x,e) 
p(x,h(e,x, d)\e) 
q{h{e,x, 0)|x, 6) 


(5) 

( 6 ) 


Assuming the mapping h is represented as a deterministic feed-forward neural network, for a fixed 
€, the gradient inside the expectation can be computed using standard backpropagation. In practice, 
one approximates the expectation in Eqn. by generating k samples of e and applying the Monte 
Carlo estimator 

k 

l^Velogw(x,h(ei,x,0),0) (7) 

i=l 

with w(x, h, 0) = p(x, h\ 6 )/q{h\x^ 6 ). This is an unbiased estimate of V 0 £(x). We note that 
the VAE update and the basic REINEORCE-like update are both unbiased estimators of the same 
gradient, but the VAE update tends to have lower variance in practice because it makes use of the 
log-likelihood gradients with respect to the latent variables. 


3 Importance Weighted Autoencoder 


The VAE objective of Eqn. [^heavily penalizes approximate posterior samples which fail to explain 
the observations. This places a strong constraint on the model, since the variational assumptions 
must be approximately satisfied in order to achieve a good lower bound. In particular, the posterior 
distribution must be approximately factorial and predictable with a feed-forward neural network. 
This VAE criterion may be too strict; a recognition network which places only a small fraction 
(e.g. 20%) of its samples in the region of high posterior probability region may still be sufficient for 
performing accurate inference. If we lower our standards in this way, this may give us additional 
fiexibility to train a generative network whose posterior distributions do not fit the VAE assump¬ 
tions. This is the motivation behind our proposed algorithm, the Importance Weighted Autoencoder 
(IWAE). 


Our IWAE uses the same architecture as the VAE, with both a generative network and a recognition 
network. The difference is that it is trained to maximize a different lower bound on logp(x). In 
particular, we use the following lower bound, corresponding to the /c-sample importance weighting 
estimate of the log-likelihood: 


£k(x) = Ehi 


.,hfe~g(h|x) 



p(x, hi) 

9(h*|x) 


( 8 ) 


Here, hi,..., h/. are sampled independently from the recognition model. The term inside the sum 
corresponds to the unnormalized importance weights for the joint distribution, which we will denote 
as Wi = p(x,hi)/g(hi|x). 


This is a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality and the 
fact that the average importance weights are an unbiased estimator of p(x): 






Wi 




< logE 


1 ^ 


Wi, 


= logp(x), 


(9) 


where the expectations are with respect to g(h|x). 


It is perhaps unintuitive that importance weighting would be a reasonable estimator in high dimen¬ 
sions. Observe, however, that the special case of A: = 1 is equivalent to the standard VAE objective 
shown in Eqn.[^ Using more samples can only improve the tightness of the bound: 
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Theorem 1. For all k, the lower bounds satisfy 

logp(x) > Ck+i > Ck. (10) 

Moreover, //’p(h, x)/g'(h|x) is bounded, then Ck approaches logp(x) as k goes to infinity. 

Proof See Appendix A. □ 

The bound Ck can be estimated using the straightforward Monte Carlo estimator, where we generate 
samples from the recognition network and average the importance weights. One might worry about 
the variance of this estimator, since importance weighting famously suffers from extremely high 
variance in cases where the proposal and target distributions are not a good match. However, as 
our estimator is based on the log of the average importance weights, it does not suffer from high 
variance. This argument is made more precise in Appendix B. 


3.1 Training procedure 

To train an IWAE with a stochastic gradient based optimizer, we use an unbiased estimate of the 
gradient of Ck, defined in Eqn.[^ As with the VAE, we use the reparameterization trick to derive a 
low-variance upate rule: 


V0£fe(x) = VeEhi,...,hfc logl^Wi = VeEei,..., 

^ i=l 

log ^ 

^ i = l 


1 ^ 

Ve log - ^ m;(x, h(x, a, 0), 0) 

^ i=l 


■ k 

mVe log w{x, h(x, ei,0), 6) 


_i=l 


( 11 ) 

( 12 ) 

(13) 


where ei,..., e/^ are the same auxiliary variables as defined in Section for the VAE, Wi = 
w{-K, h(x, e^, 0)^6) are the importance weights expressed as a deterministic function, and uf = 
the normalized importance weights. 

In the context of a gradient-based learning algorithm, we draw k samples from the recognition 
network (or, equivalently, k sets of auxiliary variables), and use the Monte Carlo estimate of Eqn.p^ 

k 

y^w^V6)logw(x, h(€j,X,0),g) . (14) 

i=l 

In the special case of k = 1, the single normalized weight wi takes the value 1, and one obtains the 
VAE update rule. 

We unpack this update because it does not quite parallel that of the standard VAEj^The gradient of 
the log weights decomposes as: 

V0logu;(x,h(x,€i,0),0) = Vologp(x,h(x,ei,0)l0) - Vologq(h(x, Ci, 0)lx, 0). (15) 

The first term encourages the generative model to assign high probability to each given 
(following the convention that x = h^). It also encourages the recognition network to adjust the 
hidden representations so that the generative network makes better predictions. In the case of a single 
stochastic layer (i.e. L = 1), the combination of these two effects is equivalent to backpropagation 
in a stochastic autoencoder. The second term of this update encourages the recognition network to 
have a spread-out distribution over predictions. This update is averaged over the samples with weight 
proportional to the importance weights, motivating the name “importance weighted autoencoder.” 

|2014| ) separated out the KL divergence in the bound of Eqn.j^in order to achieve a 
simpler and lower-variance update. Unfortunately, no analogous trick applies for /c > 1. In principle, the IWAE 
updates may be higher variance for this reason. However, in our experiments, we observed that the performance 
of the two update rules was indistinguishable in the case of k = 1. 


Kingma & Welling 
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The dominant computational cost in IWAE training is computing the activations and parameter gra¬ 
dients needed for \/g log re(x, h(x, e^, 0), 0). This corresponds to the forward and backward passes 
in backpropagation. In the basic IWAE implementation, both passes must be done independently for 
each of the k samples. Therefore, the number of operations scales linearly with k. In our GPU-based 
implementation, the samples are processed in parallel by replicating each training example k times 
within a mini-batch. 

One can greatly reduce the computational cost by adding another form of stochasticity. Specifically, 
only the forward pass is needed to compute the importance weights. The sum in Eqn. [^can be 
stochastically approximated by choosing a single sample proprtional to its normalized weight ufi 
and then computing log re(x, h(x, e^, 0), 0). This method requires k forward passes and one 
backward pass per training example. Since the backward pass requires roughly twice as many add- 
multiply operations as the forward pass, for large k, this trick reduces the number of add-multiply 
operations by roughly a factor of 3. This comes at the cost of increased variance in the updates, but 
empirically we have found the tradeoff to be favorable. 


4 Related work 


There are several broad families of approaches to training deep generative models. Some models 
are defined in terms of Boltzmann distributions ( [Smolensky] 1 1986 [Salakhutdinov & E.]|2009| ). This 
has the advantage that many of the conditional distributions are tractable, but the i nability to sample 
from the model or compute the partition function has been a major roadblo ck ([Salakhutdinov & 
Murray [2008[ ). Other models are defined in terms of belief networks ( Neal[ 1992t [Gregor et al.y 
2014[ ). These models are tractable to sample from, but the conditional distributions become tangled 
due to the explaining away effect. 

One strategy for dealing with intractable posterior inference is to train a recognition network 
which approximates t he posterior. A clas sic approach was the wake-sleep algorithm, used to train 
Helmholtz machines ( [Dayan et al.[[1995) . The generative model was trained to model the condi¬ 
tionals inferred by the recognition net, and the recognition net was trained to explain synthetic data 
generated by the generative net. Unfortunately, w ake-sleep trained the two networks on different ob¬ 
jective functions. Deep autoregressive networks ( [Gregor et al.[[2014| ) consisted of deep generative 
and recognition ne tworks trained us i ng a s ingle variational lower bound. Neural variational infer¬ 
ence and learning ( [Mnih & Gregor] 2014] ) is another algorithm for training recognition networks 
which reduces stochasticity in the updates by training a third network to predict reward baselines 
in the context of the REINEORCE algorithm ( [Williams'] 1992) . [Salakhutdinov & Larochelle ( 2010[ ) 
used a recognition network to approximate the posterior distribution in deep Boltzmann machines. 


Variational autoencoders ( [Kingma & Welling [2014] Rezende et al.[[2014] ), as described in detail in 
Section]^ are another combination of generative and recognition networks, trained with the same 
variational objective as DARN and NVIL. However, in place of REINEORCE, they reduce the 
variance of the updates through a clever reparameterization of the random choices. The reparame¬ 
terization trick is also known as “backprop through a random number generator” ( [Williams 1992). 


One factor distinguishing VAEs from the other models described above is that the model is described 
in terms of a simple distribution followed by a deterministic mapping, rather than a sequence of 
stochastic choices. Similar archit ectures have been propose d which use different training objectives. 
Generative adversarial networks ( [Goodfellow et al.] [2014[ ) train a generative network and a recog¬ 
nition network which act in opposition: the recognition network attempts to distinguish between 
training examples and generated samples, and the generative model tries to ge nerate samples which 
fool the recognition network. Maximum mean discrepancy (MMD) networks ( [Li et al.][2015t[Dziu^ 
gaite et ah] [2Q15[ ) attempt to generate samples which match a certain set of statistics of the training 
data. They can be vi ewed as a kind of adver sarial net where the adversary simply looks at the set of 
pre-chosen statistics ( [Dziugaite et al.[[MT5] ). In contrast to VAEs, the training criteria for adversarial 
nets and MMD nets are not based on the data log-likelihood. 


Other researchers have d eri ved log-probabil ity lower bounds by way of importance sampling. [Tang 
& Salakhutdinov ( 2Q13[ ) and Baet al.[(2Q15 ) avoided recognition networks entirely, instead perform 
ing inference using importance sampling from the prior. [Gogate et al.[ ( [2007[ ) presented a variety 
of graphical model inference algorithms based on importance weighting. Reweighted wake-sleep 
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(RWS) of |Bornschein & Bengio| ( [MTS] ) is another recognition network approach which combines 
the original wake-sleep algorith m with updates to the generativ e network equivalent to gradient as¬ 
cent on our bound Ck . However, [Bornschein & Bengi^ ( |2Q15| ) interpret this update as following a 
biased estimate of logp(x), whereas we interpret it as following an unbiased estimate 
The IWAE also differs from RWS in that the generative and recognition networks are trained to 
maximize a single objective, £/.. By contrast, the g^-wake and sleep steps of RWS do not appear to 
be related to £/c. Finally, the IWAE differs from RWS in that it makes use of the reparameterization 
trick. 


Apart from our approach of using multiple approximate posterior samples, another way to improve 
the flexibility of posterior inference is to use a more sophisti cated algorithm than importan ce sam¬ 
pling. Examples of this approach include normalizing flo ws ([Rezende & Mohamed |2015| ) and the 
Hamiltonian variational approximation of Salimans et al.| ( [2015] r 


After the publication of this paper the authors learned that the idea of using an importance weighted 
lower bound for training variational autoencoders has been independently explored by Laurent Dinh 
and Vincent Dumoulin, and preliminary results of their work were presented at the 2014 CIFAR 
NCAP Deep Learning summer school. 


5 Experimental results 

We have compared the generative performance of the VAE and IWAE in terms of their held-out log- 
likelihoods on two density estimation benchmark datasets. We have further investigated a particular 
issue we have observed with VAEs and IWAEs, namely that they learn latent spaces of signiflcantly 
lower dimensionality than the modeling capacity they are allowed. We tested whether the IWAE 
training method ameliorates this effect. 


5.1 Evaluation on density estimation 


We ev aluated the models o n two benchmark datasets: MNIST, a dataset of images of handwritten 
digits dLeCun et al.l 1 998[), and Omniglot, a dataset of handwritten characters in a variety of world 
alphabets (Lake et al. 2Q13t . In both cases, the observations were binarized 28 x 28 images]^ We 


used the standard sp 


its of MNIST into 60,000 training and 10,000 test examples, and of Omniglot 


into 24,345 training and 8,070 test examples. 
We trained models with two architectures: 


1. An architecture with a single stochastic layer with 50 units. In between the observations 
and the stochastic layer were two deterministic layers, each with 200 units. 

2. An architecture with two stochastic layers and h^, with 100 and 50 units, respectively. 
In between x and were two deterministic layers with 200 units each. In between and 

were two deterministic layers with 100 units each. 


All deterministic hidden units used the tanh nonlinearity. All stochastic layers used Gaussian dis¬ 
tributions with diagonal covariance, with the exception of the visible layer, which used Bernoulli 
distributions. An exp nonlinearity was applied to the predicted variances of the Gaussian distribu¬ 
tions. The network architectures are summarized in Appendix C. 


All mo dels were initialized w ith the heuristic of Glorot & Bengio (2010). For optimization, we used 
Adam (Kingma & Ba 2015) with parameters /3i = 0.9, = 0.999, e = 10“^ and minibaches of 

size 20. The training proceeded for 3* passes over the data with learning rate of 0.001 • 10“*/^ for 
i = 0 ... 7 (for a total of Yli=o ~ passes over the data). This learning rate schedule was 
chosen based on preliminary experiments training a VAE with one stochastic layer on MNIST. 


^Unfortunately, the generative modeling literature is inconsistent about the method of binariz ation, and 
different choices can lea d to considerably different log-Iikelihood values. We follow the procedure of |SaIakhuL| 
|dinov & Murray! p008| ): the binary-valued observations are sampled with expectations equal to the real values 
in the training set. See Appendix D for an alternative binarization scheme. 
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MNIST OMNIGLOT 

VAE IWAE VAE IWAE 


# stoch. active active active active 

layers k NLL units NLL units NLL units NLL units 


1 

1 

86.76 

19 

86.76 

19 

108.11 

28 

108.11 

28 


5 

86.47 

20 

85.54 

22 

107.62 

28 

106.12 

34 


50 

86.35 

20 

84.78 

25 

107.80 

28 

104.67 

41 

2 

1 

85.33 

16+5 

85.33 

16+5 

107.58 

28+4 

107.56 

30+5 


5 

85.01 

17+5 

83.89 

21+5 

106.31 

30+5 

104.79 

38+6 


50 

84.78 

17+5 

82.90 

26+7 

106.30 

30+5 

103.38 

44+7 


Table 1: Results on density estimation and the number of active latent dimensions. For models with two latent 
layers, “/ci + /C 2 ” denotes ki active units in the first layer and k 2 in the second layer. The generative performance 
of IWAEs improved with increasing k, while that of VAEs benefitted only slightly. Two-layer models achieved 
better generative performance than one-layer models. 


Eor each number of samples k G {1,5, 50} we trained a VAE with the gradient of C{x.) estimted 
as in Eqn. [^and an IWAE with the gradient estimated as in Eqn. 14 Eor each k, the VAE and the 
IWAE were trained for approximately the same length of time. 


All log-likelihood values were estimated as the mean of £5000 on the test set. Hence, the reported 
values are stochastic lower bounds on the true value, but are likely to be more accurate than the 
lower bounds used for training. 

The log-likelihood results are reported in Table Our VAE results are comparable to those previ¬ 
ously reported in the literature. We observe that training a VAE with k > 1 helped only slightly. By 
contrast, using multiple samples improved the IWAE results considerably on both datasets. Note that 
the two algorithms are identical for /c = 1, so the results ought to match up to random variability. 


On MNIST, IWAE with two stochastic layers and k = 50 achieves a log-likelihood of -82.90 on 
the permutation-invariant model on this dataset. By comparison, deep belief networks achieved log- 
likelihood of approximately -84.55 nats ( [Murray & Salakhutdinov[|2009|), and deep autoregr essive 
networks achieved log-likelihood of -84.13 nats ( Gregor et al.j 2014] r“ Gregor et al. ( |2015| ), who 
exploited spatial structure, achieved a log-likelihood of -80.97. We did not find overfitting to be a 
serious issue for either the VAE or the IWAE: in both cases, the training log-likelihood was 0.62 to 
0.79 nats higher than the test log-likelihood. We present samples from our models in Appendix E. 


Eor the OMNIGLOT dataset, the best performing IWAE has log-likelihood of -103.38 nats, which is 
slightly worse than the log-likelihood of -100.46 nats achieved by a Restricted Boltzmann Machine 
with 500 hidden units trained with persistent contrastive divergence ( jBurda et al.j [2015] ). RBMs 


trained with centering o r EANG methods achieve a similar performance of around -100 nats ( Grosse 
|& Salakhudinov |2015| ). The training log-likelihood for the models we trained was 2.39 to 2.65 nats 
higher than the test log-likelihood. 


5.2 Latent space representation 

We have observed that both VAEs and IWAEs tend to learn latent representations with effective 
dimensions far below their capacity. Our next set of experiments aimed to quantify this effect and 
determine whether the IWAE objective ameliorates this effect. 

If a latent dimension encodes useful information about the data, we would expect its distribution 
to change depending on the observations. Based on this intuition, we measured activity of a latent 
dimension u using the statistic Au = Covx (lEu~g(w|x) [^]) • We defined the dimension u to be active 
if Au > 10“^. We have observed two pieces of evidence that this criterion is both well-defined and 
meaningful: 


1. The distribution of Au for a trained model consisted of two widely separated modes, as 
shown in Appendix C. 
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First stage Second stage 

trained as NLL active units trained as NLL active units 


Experiment 1 

VAE 

86.76 

19 

IWAE, k = 50 

84.88 

22 

Experiment 2 

IWAE, k = 50 

84.78 

25 

VAE 

86.02 

23 


Table 2: Results of continuing to train a VAE model with the IWAE objective, and vice versa. Training the 
VAE with the IWAE objective increased the latent dimension and test log-likelihood, while training the IWAE 
with the VAE objective had the opposite effect. 


2. To confirm that the inactive dimensions were indeed insignificant to the predictions, we 
evaluated all models with the inactive dimensions removed. In all cases, this changed the 
test log-likelihood by less than 0.06 nats. 

In Tablewe report the numbers of active units for all conditions. In all conditions, the number of 
active dimensions was far less than the total number of dimensions. Adding more latent dimensions 
did not increase the number of active dimensions. Interestingly, in the two-layer models, the second 
layer used very little of its modeling capacity: the number of active dimensions was always less 
than 10. In all cases with k > 1, the IWAE learned more latent dimensions than the VAE. Since this 
coincided with higher log-likelihood values, we speculate that a larger number of active dimensions 
refiects a richer latent representation. 

Superficially, the phenomenon of inactive dimensions appears similar to the problem of “units dying 
out” in neural networks and latent variable models, an effect which is often ascribed to difficulties 
in optimization. For example, if a unit is inactive, it may never receive a meaningful gradient signal 
because of a plateau in the optimization landscape. In such cases, the problem may be avoided 
through a better initialization. To determine whether the inactive units resulted from an optimization 
issue or a modeling issue, we took the best-performing VAE and IWAE models from Table and 
continued training the VAE model using the IWAE objective and vice versa. In both cases, the model 
was trained for an additional 3^ passes over the data with a learning rate of 10“^. 

The results are shown in Tablej^ We found that continuing to train the VAE with the IWAE objective 
increased the number of active dimensions and the test log-likelihood, while continuing to train 
the IWAE with the VAE objective did the opposite. The fact that training with the VAE objective 
actively reduces both the number of active dimensions and the log-likelihood strongly suggests that 
inactivation of the latent dimensions is driven by the objective functions rather than by optimization 
issues. On the other hand, optimization also appears to play a role, as the results in Tableare not 
quite identical to those in Table 

6 Conclusion 

In this paper, we presented the importance weighted autoencoder, a variant on the VAE trained by 
maximizing a tighter log-likelihood lower bound derived from importance weighting. We showed 
empirically that IWAEs learn richer latent representations and achieve better generative performance 
than VAEs with equivalent architectures and training time. We believe this method may improve the 
flexibility of other generative models currently trained with the VAE objective. 
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Arrendix a 


Proof of Theorem 1. We need to show the following facts about the log-likelihood lower bound Ck'. 

1. logp(x) > Ck, 

2. Ck > Cm for k >m, 

3. logp(x) = lim/c^oo Ck, assuming p(h, x)/g(h|x) is bounded. 

We prove each in turn: 

1. It follows from Jensen’s inequality that 


Ck = E 


■ k 


p(x, hi) 
k ^ g(hi|x) 


< logE 


1 ^ p(x, hj) 

k ^ «(hi|x) 


= logp(x) 


(16) 


2. Let / C {1,..., /c} with |/| = m be a uniformly distributed subset of distinct indices from 
{1,..., k}. We will use the following simple observation: 
ai+.^.+afc sequence of numbers ai,..., a/^. 

Using this observation and Jensen’s inequality, we get 


m 


Ck — 

= Ehi,....hfc 

= Ehi,...,hm 



P(x, hj) 

9(hi|x) 




1 ^ p(x,hij 





P(x,hij 

9(hi,|x) 



P(x, hj) 

9(hi|x) 


= L 


m 


(17) 

(18) 

(19) 

( 20 ) 


3. Consider the random variable Mk = \ x)/ 9 '(h|x) is bounded, then 

it follows from the strong law of large numbers that Mk converges to E^^h^lx) 
p(x) almost surely. Hence Ck = Elog[M/c] converges to logp(x) sls k ^ oo. 


r>(x,hi) 

g(hi|x) 
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Appendix B 


It is well known that the variance of an unnormalized importance sampling based estimator can 
be extremely large, or even infinite, if the proposal distribution is not well matched to the target 
distribution. Here we argue that the Monte Carlo estimator of Ck, described in Section does not 
suffer from large variance. More precisely, we bound the mean absolute deviation (MAD). While 
this does not directly bound the variance, it would be surprising if an estimator had small MAD yet 
extremely large variance. 

Suppose we have a strictly positive unbiased estimator Z of a positive quantity Z, and we wish to 
use log Z as an estimator of log Z. By Jensen’s inequality, this is a biased estimator, i.e. E[log Z] < 
log Z. Denote the bias SiS 6 = log Z — E[log Z]. We start with the observation that log Z is unlikely 
to overestimate log Z by very much, as can be shown with Markov’s Inequality: 

Pr(log Z > log Z -\-b) < e~^. (21) 


Let (X)+ denote max(X, 0). We now use the above facts to bound the MAD: 


E 


logZ — E[log Z] 


= 2E 

= 2E 

< 2E 

= 2E 

= 2 


(logZ-E[logZ]) 

^log Z — log Z + log Z — E[log Z]) 
^log Z - \ogZ^^ + ^log Z - E[log 
(^log Z - log 


25 


fpr(log 


Z — log Z > t) dt + 2(5 


poo 

<2 e-^dt + 25 

Jo 

= 2 + 25 


( 22 ) 

(23) 

(24) 

(25) 

(26) 

(27) 

(28) 


Here, (22( is a general formula for the MAD, (26( uses the formula E[y] = Pr(y > t) dt for 
a nonnegative random variable Y, and apptes the bound ( [21] ). Hence, the MAD is bounded by 
2 + 2S. In the context of IWAE, 6 corresponds to the gap between Ck and logp(x). 


Appendix C 
Network architectures 

Here is a summary of the network architectures used in the experiments: 
g(hi|x) = A/'(hVg,i,diag(crg,i)) 


g(h2|hi) = A/'(h2|^^2>diag(cr,,2)) 

p(hi|h2) = A/'(hVp,i,diag((Tp,i)) 

p(x|h^) = Bernoulli (x I/Ltpo) 
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Distribution of activity statistic 


5.2 


we defined the activity statistic Au = Covx 


-g(u|x) 


[u ]), and chose a threshold 


In Section 

of 10“^ for determining if a unit is active. One justification for this is that the distribution of this 
statistic consisted of two widely separated modes in every case we looked at. Here is the histogram 
of log Au for a VAE with one stochastic layer: 



Visualization of posterior distributions 


We show some examples of true and approximate posteriors for VAE and IWAE models trained with 
two latent dimensions. Heat maps show true posterior distributions for 6 training examples, and the 
pictures in the bottom row show the examples and their reconstruction from samples from g'(h|x). 
Left: VAE. Middle: IWAE, with k = b. Right: IWAE, with k = 50. The IWAE prefers less regular 
posteriors and more spread out posterior predictions. 
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Appendix D 


Results for a fixed MNIST binarization 


Several previous works have used a fixed binarization of the MNIST dataset defined by |Larochelle 


( 2011| ). We repeated our experiments training the models on the 50000 examples from the training 
dataset, and evaluating them on the 10000 examples from the test dataset. Otherwise we used the 
same training procedure and hyperparameters as in the experiments in the main part of the paper. 
The results in table [3] indicate that the conclusions about the relative merits of VAEs and IWAEs are 
unchanged in the new experimental setup. In this setup we noticed significantly larger amounts of 
overfitting. 

VAE IWAE 


# stoch. active active 

layers k NLL units NLL units 


1 

1 

88.71 

19 

88.71 

19 


5 

88.83 

19 

87.63 

22 


50 

89.05 

20 

87.10 

24 

2 

1 

88.08 

16+5 

88.08 

16+5 


5 

87.63 

17+5 

86.17 

21+5 


50 

87.86 

17+6 

85.32 

24+7 


Table 3: Results on density estimation and the number of active latent dimensions on the fixed binarization 
MNIST dataset. For models with two latent layers, “/ci + /C 2 ” denotes ki active units in the first layer and k 2 
in the second layer. The generative performance of IWAEs improved with increasing k, while that of VAEs 
benefitted only slightly. Two-layer models achieved better generative performance than one-layer models. 


Appendix E 

Samples 
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Table 4: Random samples from VAE (left column) and IWAE with fc = 50 (right column) models. Row 1: 
models with one stochastic layer. Row 2; models with two stochastic layers. Samples are represented as the 
means of the corresponding Bernoulli distributions. 
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